part of education series: connecting the dots

Solve Problems with Knowledge Graphs

by Paula Perez

A knowledge graph is a way to represent knowledge

It helps us visualize information in a more intuitive way

It helps to find objects

controlled vocabularies (MeSH); ontologies (Gene Ontology)

It answers questions

knowledge graphs on the Web; SPARQL query language

It generates hypotheses

knowledge plus computation = inference; ABC model

Objects in knowledge graphs are organized in ontologies

An ontology is defined as "the branch of metaphysics dealing with the nature of being"
In practice, it is a set of concepts, definitions and inter-relationships
There are hundreds of ontologies in life sciences

Example: Gene Ontology

Started in 1999 as a collaboration among three Model Organism Databases

GO is a set of concepts and their relationships to each other organized within a hierarchy.

GO is a way to capture biological knowledge for individual gene products in a way computers understand.

Improve treatment of credits and references

Ontology Slide credit: Mélanie Courtot, Ph.D.

https://www.ebi.ac.uk/QuickGO/

· protein kinase activity· insulin receptor activity· mitochondrion· mitochondrial matrix· mitochondrial inner membraneinsulin receptorinsulin

GO Branches

An ontology like GO allows us to define different aspects of biological knowledge

Molecular Function

An elemental activity or task or job

Biological Process

A commonly recognized series of events

Cellular Component

Where a gene is located

WISDOMKNOWLEDGEINFORMATIONDATAKnowing what to do with Drug A ...Drug A caused Gene X to be expressedGene X is expressedRNASeq reads

But at times, an ontology like GO is not enough

Knowledge graphs, or knowledge bases, go one step further than ontologies or databases.

It is an integrated collection of claims that can be represented in a graph.

They help with the communication between machines and humans, and allow for more organic and implicit information to be interpreted.

Why knowledge graphs?

To answer explicit questions from the user.

To uncover implicit relationships that are "hidden" within daily language and context.

To go deeper in the information provided to the user.

Some famous knowledge graphs

Wikidata: the Wikipedia Foundation knowledge graph

UniProt Protein knowledge base

Google Knowledge Graph

PubChem from US National Institutes of Health (NIH)

Google Knowledge Graph

search for: Vemurafenib

2,310,000 results

1 infobox

1 node in Google Knowledge Graph

Why use a Knowledge Graphs?

Answer explicit questions; uncover implicit relations

?

Implicit relations for hypothesis generation

ABC model

Co-occurs in an article withRaynaud’s SyndromeACo-occurs in an article withBC?Dietary fish oil· platelet inhibition· vasodilation· lower blood viscosity

Open Discovery and Closed Discovery

Open, you don’t know what C or B is (e.g. disease -> ?drug)

Closed, you know what C is and are looking for B (e.g. disease – why? – drug)

ABC?
DRUGNameRegistered IndicationNew Indicationrepurposing potentialDISEASEKnown Drug in DrugBankNumber of Medline AbstractsNumber in Clinical Trial RegistryRepurposing Potential

Example question: drug repurposing

For a given drug, what diseases might it be used to treat?

Detail
Detail
Detail

Implicit relations for hypothesis generation

ABC model for drug repurposing

physical interactiondrugAgenetic associationBC?diseasegenes
Asking forMetforminMetformin?diseasetreatstreatsConstraintsResult:Type 2 diabetes

Query a knowledge graph with SPARQL

SPARQL protocol and RDF query language

RDF: Resource Description Framework (common standard for storing knowledge graphs)

A SPARQL query = a partially completed graph

?'s show what you are looking for
remainder constrains the search

Using Wikidata, Metformin’s unique id: Q19484
treats property id: P2175

Metformintreats?disease=

Adding human readable labels to the SPARQL query

Metformintreats?disease

Our question:

“For a given drug, what diseases might it be used to treat?”

mode of actionAgonistAntagonistOtherUnknownGENESNPsymbolassociationP-ValueOdds RatioGWAS or PheWASDRUGNameRegistered IndicationNew Indicationrepurposing potentialDISEASEKnown Drug in DrugBankNumber of Medline AbstractsNumber in Clinical Trial RegistryRepurposing Potential
interacts withencoded bygenetic associationtreats???drugprotein?diseasegene
interacts withencoded bygenetic associationtreats??MetforminSolute carrier family 22member 3prostate cancerSLC22A3

Answer to our question:

“For a given drug, what diseases might it be used to treat?”

detail
detail
detail

Put SPARQL form and processor here

PubChem is a biochemistry knowledge base

The following pages show how to use PubChem to:

identify an item by measurement value
link an item to similar items
define measurement methods to link an item to similar items
identify a specific dataset from an investigation
identify a specific study and its evidence
provide additional information about a bioassay from an investigation

Identify a unique item

Atorvastatin (CID60823)
identified by InChIKey XUKUURHRXDUEBC-KAYWLYCHSA-N
as defined by molar mass unit (obo:UO_0000055)
expressed as double floating point
also identified in ChEBI (SID103554720)
compound:CID60823
descr:CID60823_Molecular_Weight
obo:UO_0000055
“558.639803”^^xsd:double
sio:CHEMINF_000334
substance:SID103554720
syno:MD5_9a05646d461669f86de312d88ab5748a
ChEBI:39548
“atorvastatin”@en
sio:CHEMINF_000339
sio:has-attribute
sio:is-attribute-of
sio:has-attribute
sio:is-attribute-of
sio:has-unit
rdf:type
rdf:type
rdf:type
sio:has-value
sio:has-value

Link an item to similar items

Express similarity:
by molecular formula & bonds
by isotopic composition
by connectivity
by 2 dimension shape
by 3 dimension shape
compound:CID10507504
compound:CID60823
compound:CID23665101
compound:CID10507504
compound:CID10507504
compound:CID10030610
compound:CID11330946
sio:is_stereoisomer_of
sio:has_component

More about linking similar items.


Similarity to other substances
item-1
(3R,5R)-7-[2-(4-fluorophenyl)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-yl(314C)pyrrol-1-yl]-3,5-dihydroxyheptanoic acid
XUKUURHRXDUEBC-IEILOPHXSA-N
item-2
(3R,5R)-7-[2-cyclopropyl-5-(4-fluorophenyl)-4-phenyl-3-(phenylcarbamoyl)pyrrol-1-yl]-3,5-dihydroxyheptanoic acid
WDKDRVHIGPVFCW-KAYWLYCHSA-N
item-3
(E,3R,5S)-7-[5-[(3-chlorophenyl)carbamoyl]-3,4-bis(4-fluorophenyl)-1-propan-2-ylpyrrol-2-yl]-3,5-dihydroxyhept-6-enoic acid
JNJZDKOQTIGNNB-FDFRSNTISA-N
Has component
Atorvastatin sodium
VVRPOCPLIUDBSA-CNZCJKERSA-M

Define measurement to link similar items

Define items:
similar to Atorvastatin
by chemical structure
by fingerprint
using TanimotoScore
expressed as a double precision decimal
compound:CID60823
sio:CHEMINF_000333
“0.98“^^xsd:double
nbr:CID60823_CID10030610_2DSimilarity
compound:CID10030610
vocab:2D_structural
_similarity
vocab:2D_Fingerprint
_TanimotoScore
nbr:CID60823_CID10030610_2DTanimotoScore
rdf: type
rdf: type
sio:is-output-of
sio:has-measurement-value
sio:has-value

Identify a specific dataset from an investigation

A dataset about …
Atorvastatin
recorded for the ChEMBL data source
provided by the EBI organization
about a concept assigned the label:
“Biological Properties”
substance:SID103554720
dcterms:Dataset
skos:Concept
source:ChEMBL
http://www.ebi.ac.uk/chembldb/
source:European_Bioinformatics
_Institute_EBI_ChEMBL
skos:ConceptScheme
vocab:Biological_Properties
vocab:Substance_Categorization
_Classification
“Biological Properties“
sio:has-value
foaf:homepage
dcterms:source
pav:providedBy
rdf:type
rdf:type

Identify a specific study and its evidence

Study summarized as: “A new class of anti-inflammatory ester prodrugs”
published by DiscoveryGate
about SID103164874 (CHEBI:15365) and related compound – aspirin
based on a bioassay with the title “Inhibition of ovine COX1 by enzyme immunoassay”
bao:measure-group
reference:PMID19500994
http://www.ncbi.nlm.nih.gov/pubmed/19500994
“Prostaglandin
G/H synthase 1”@en
protein:GI548481
rdf:type
rdf:type
dcterms:title
pr:PR_000006933
“0.35”^^xsd:double
fabio:JournalArticle
“Inhibition of ovine COX1
by enzyme immunoassay”@en
bao:ic50
measuregroup:
AID447528
ops:Micromolar
substance:SID103164874
iao:is-about
endpoint:
SID10316487_AID447528
qudt:unit

Provide additional information about a bioassay

Bioassay title: “Primary Cell-based assay for inhibitors of the Retinoic Acid Receptor-related orphan receptor A (RORA)”
summarized as: “probe development efforts to identify novel modulators of ROR” as measured by dose-response cell-based assay for RORA inhibitors
bao:bioassay
bioassay:AID2139
bao:primary-assay
bao:confirmatory-assay
bao:has-assay-stage
bao:has-summary-assay
bao:has-assay-stage
bao:has-assay-stage
bao:summary-assay
measuregroup:AID610
bioassay:AID610
bioassay:AID561
bao:measure-group
rdf:type
rdf:type
bao:has-measure-group
bao:is-measure-group-of

Links among PubChem subdomains

At the center: measurments and evidence (bioassays)
Biologics are proteins and genes from specific sources
Items are chemical compounds and substances
Evidence is documented in a study (reference)
similar protein
protein
biosystem
domain
neighbor
gene
measuregroup
endpoint
reference
bioassay
source
substance
synonym
descriptor
inchikey
compound
neighbor
component compound
parent compound
same connectivity
isotopomer
stereoisomer
similar compound

PubChem in a Single Picture

More about PubChem RDF

Conclusion

A knowledge graph helps us solve problems.

We used questions from life sciences to illustrate problem solving.

Knowledge graphs help us discover information that often is difficult to find.

Knowledge graphs help us connect the dots to see new possibilities and solutions.